usb/cdc: fix RP2 USB CDC TX race with cores scheduler#5391
Conversation
ce5c04a to
c059707
Compare
|
@rdon-key thank you for looking into this. Your solution makes sense to me, I just suggested a small correction in form. |
|
@deadprogram |
|
Anyone else have any feedback or comments before merge? |
|
@soypat ? |
fb52e10 to
c690c2c
Compare
| } | ||
| usbcdc.tx.Discard(inflight) | ||
| usbcdc.inflight.Store(0) | ||
| usbcdc.sendFromRing() |
There was a problem hiding this comment.
It is odd to me this is not guarded with a txActive compare-swap. I think my issue with this PR as it stands is it is not clear to me what txActive is guarding. From the looks of it sendFromRing can be called concurrently from kickTx and txhandler and we've added a bunch of logic in sendFromRing to deal with this. I feel like we can maybe keep sendFromRing as a simple function that does what it says and guard it from executing concurrently as I suspect that concurrent execution is not necessary for this to work correctly.
maybe we can even delete kickTx and simply add the guard at the start of sendFromRing and do a simple Store(0) immediately when sendFromRing ends it's process?
There was a problem hiding this comment.
That sounds very reasonable, and less complex. What do you think @rdon-key?
There was a problem hiding this comment.
txActive is not only guarding the lexical execution of sendFromRing. It represents ownership of the asynchronous TX pump.
This is the usual queue + active-flag missed-wakeup pattern: while the active flag is set, producers may enqueue more work but do not start another worker. Therefore, when the current owner is about to go idle, it must clear the active flag and then re-check the queue before returning.
In this case, sendFromRing submits one USB IN packet and then returns, but the TX pump must still be considered active while that packet is in flight. The ownership is continued by txhandler when the TX completion arrives.
That is why txhandler does not acquire txActive with CAS before calling sendFromRing: it is continuing the ownership that was acquired by kickTx.
If we put the CAS at the start of sendFromRing, the completion path would normally see txActive already set and fail to continue the pump. If we clear txActive immediately when sendFromRing returns after submitting a packet, Write could acquire it and call SendUSBInPacket again while the previous packet is still in flight, which is the race this PR is trying to avoid.
The intended ownership model:
- kickTx acquires txActive when the TX pump is idle.
- sendFromRing sends one packet while ownership is held.
- txhandler continues the same ownership across TX completions.
- ownership is released only when the ring appears empty, with a final re-check to avoid a missed wakeup.
Fixes #5377.
This PR fixes a USB CDC TX race on RP2 targets when using
-scheduler=cores.The issue was reproduced on Raspberry Pi Pico / RP2040 with TinyGo 0.41.1.
RP2350 may be affected as well because it shares the RP2 USB backend and the
same USB CDC TX state machine.
Cause
USB CDC TX is driven from two places:
Write()viakickTx()txhandler()The previous code used
inflightboth as:That is not sufficient on multicore targets.
With
-scheduler=cores,Write()may run on one core while the USB TXcompletion handler runs on another core. In that case,
kickTx()can observe theTX path as idle while
txhandler()is still processing a completed packet andchaining the next one.
That can allow concurrent or re-entrant calls into
sendFromRing()/machine.SendUSBInPacket()for the same USB IN endpoint, which can corrupt USBCDC output.
Fix
This PR adds a separate atomic
txActiveflag.txActiverepresents ownership of the USB CDC TX pump.inflightremains onlythe number of bytes currently submitted to the endpoint.
kickTx()now starts the TX pump only if it can acquiretxActivewith CAS.The TX completion handler keeps TX pump ownership while chaining packets, and
ownership is released only when the TX ring is empty.
A final ring re-check is used when releasing ownership to avoid a missed wakeup
if
Write()appends data whiletxActiveis still set.Manual test
Test program:
Commands:
Results:
The
-scheduler=corestest was repeated multiple times without observingdropped or corrupted USB CDC output.